ROC Curve (Receiver Operating Characteristic)#

The ROC curve visualizes the tradeoff between:

  • True Positive Rate (TPR / recall / sensitivity)

  • False Positive Rate (FPR = 1 - specificity)

as we sweep a decision threshold over a model’s scores (probabilities, logits, or any ranking score).


Learning goals#

By the end you should be able to:

  • define TPR/FPR from the confusion matrix

  • compute ROC points by threshold-sweeping

  • implement roc_curve and AUC from scratch (NumPy)

  • pick an operating threshold with ROC constraints (e.g. “FPR ≤ 5%”)

  • use AUC-ROC to pick a hyperparameter for logistic regression

Quick import (scikit-learn)#

from sklearn.metrics import roc_curve, roc_auc_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

# Reproducibility: record versions (nice to have in a knowledge base)
import sys
import plotly

print("python:", sys.version.split()[0])
print("numpy :", np.__version__)
print("plotly:", plotly.__version__)

try:
    import sklearn

    print("sklearn:", sklearn.__version__)
except Exception as e:
    print("sklearn: not available ->", repr(e))
python: 3.12.9
numpy : 1.26.2
plotly: 6.5.2
sklearn: 1.6.0

1) From scores to decisions#

Let \(y_i \in \{0,1\}\) be the true label and \(s_i \in \mathbb{R}\) be a score where larger means more likely positive.

A threshold \(t\) turns scores into hard predictions:

\[ \hat{y}_i(t) = \mathbb{1}[s_i \ge t] \]

This creates the confusion-matrix counts:

\[\begin{split} \begin{aligned} \mathrm{TP}(t) &= \sum_i \mathbb{1}[y_i=1 \land s_i \ge t] \\ \mathrm{FP}(t) &= \sum_i \mathbb{1}[y_i=0 \land s_i \ge t] \\ \mathrm{TN}(t) &= \sum_i \mathbb{1}[y_i=0 \land s_i < t] \\ \mathrm{FN}(t) &= \sum_i \mathbb{1}[y_i=1 \land s_i < t] \end{aligned} \end{split}\]

Two key rates (both in \([0,1]\)):

\[ \mathrm{TPR}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t)+\mathrm{FN}(t)} \qquad \mathrm{FPR}(t) = \frac{\mathrm{FP}(t)}{\mathrm{FP}(t)+\mathrm{TN}(t)} \]
  • \(\mathrm{TPR}(t)\) is sensitivity / recall: \(P(\hat{y}=1\mid y=1)\)

  • \(\mathrm{FPR}(t)\) is \(1-\text{specificity}\): \(P(\hat{y}=1\mid y=0)\)

ROC curve: the set of points \((\mathrm{FPR}(t),\mathrm{TPR}(t))\) as we sweep \(t\) from \(+\infty\) down to \(-\infty\).

  • At \(t=+\infty\): predict everything negative → \((0,0)\)

  • At \(t=-\infty\): predict everything positive → \((1,1)\)

# Toy scores: positives tend to have higher scores, but overlap with negatives.
n_pos, n_neg = 250, 350
scores_pos = rng.normal(loc=1.2, scale=1.0, size=n_pos)
scores_neg = rng.normal(loc=0.0, scale=1.0, size=n_neg)

y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[scores_pos, scores_neg]

threshold_example = 0.5

fig = go.Figure()
fig.add_histogram(
    x=scores_neg,
    name="y=0 (negative)",
    nbinsx=50,
    opacity=0.6,
    histnorm="probability density",
)
fig.add_histogram(
    x=scores_pos,
    name="y=1 (positive)",
    nbinsx=50,
    opacity=0.6,
    histnorm="probability density",
)
fig.add_vline(x=threshold_example, line_width=2, line_dash="dash", line_color="black")
fig.update_layout(
    barmode="overlay",
    title="Toy scores: two overlapping distributions (a threshold splits predictions)",
    xaxis_title="score s (higher ⇒ more positive)",
    yaxis_title="density",
)
fig.show()
def confusion_counts(y_true, y_score, threshold):
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score)
    y_pred = (y_score >= threshold).astype(int)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return int(tp), int(fp), int(tn), int(fn)


tp, fp, tn, fn = confusion_counts(y_true, y_score, threshold_example)
tpr_example = tp / (tp + fn)
fpr_example = fp / (fp + tn)

print(f"threshold t = {threshold_example:.2f}")
print(f"TP={tp}, FP={fp}, TN={tn}, FN={fn}")
print(f"TPR={tpr_example:.3f}, FPR={fpr_example:.3f}")
threshold t = 0.50
TP=186, FP=99, TN=251, FN=64
TPR=0.744, FPR=0.283

2) Computing the ROC curve (NumPy)#

A naive ROC implementation tries many thresholds and recomputes the confusion matrix each time. That works, but it can be slow.

A standard efficient approach:

  1. Sort samples by score \(s\) (descending).

  2. Sweep the threshold from high to low.

  3. Each time the threshold crosses a score value, that sample flips from predicted negative → positive.

  4. Cumulative sums give \(\mathrm{TP}(t)\) and \(\mathrm{FP}(t)\) at each unique score.

This is \(O(n\log n)\) for the sort, then \(O(n)\) for the sweep.

def roc_curve_numpy(y_true, y_score):
    """Compute ROC curve points (FPR, TPR) and thresholds.

    Parameters
    ----------
    y_true : array-like, shape (n_samples,)
        Binary labels {0,1}.
    y_score : array-like, shape (n_samples,)
        Scores where higher means more likely positive.

    Returns
    -------
    fpr : ndarray
        False positive rates (non-decreasing).
    tpr : ndarray
        True positive rates (non-decreasing).
    thresholds : ndarray
        Thresholds in descending score order, starting with +inf.
    """
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)

    if y_true.shape != y_score.shape:
        raise ValueError("y_true and y_score must have the same shape.")

    y_true = y_true.astype(int)
    unique = np.unique(y_true)
    if unique.size != 2 or not np.array_equal(unique, [0, 1]):
        raise ValueError(f"y_true must contain both 0 and 1 labels; got {unique}.")

    # Sort by decreasing score (stable sort => deterministic tie handling)
    order = np.argsort(-y_score, kind="mergesort")
    y_true_sorted = y_true[order]
    y_score_sorted = y_score[order]

    pos = y_true_sorted == 1
    neg = ~pos

    n_pos = pos.sum()
    n_neg = neg.sum()

    tps = np.cumsum(pos)
    fps = np.cumsum(neg)

    # ROC points only change when the threshold passes a distinct score value.
    distinct_score_indices = np.where(np.diff(y_score_sorted) != 0)[0]
    threshold_indices = np.r_[distinct_score_indices, y_true_sorted.size - 1]

    thresholds = y_score_sorted[threshold_indices]
    tpr = tps[threshold_indices] / n_pos
    fpr = fps[threshold_indices] / n_neg

    # Add the (0,0) start point at threshold = +inf
    thresholds = np.r_[np.inf, thresholds]
    tpr = np.r_[0.0, tpr]
    fpr = np.r_[0.0, fpr]

    return fpr, tpr, thresholds


def auc_trapz(x, y):
    """Area under a curve via the trapezoidal rule.

    Assumes x is sorted in non-decreasing order.
    """
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if x.ndim != 1 or y.ndim != 1 or x.shape != y.shape:
        raise ValueError("x and y must be 1D arrays of the same length.")
    if np.any(np.diff(x) < 0):
        raise ValueError("x must be sorted in non-decreasing order.")
    return float(np.trapz(y, x))
fpr, tpr, thresholds = roc_curve_numpy(y_true, y_score)
auc_roc = auc_trapz(fpr, tpr)

print("AUC (NumPy trapezoid):", auc_roc)

try:
    from sklearn.metrics import roc_auc_score as sk_roc_auc_score, roc_curve as sk_roc_curve

    fpr_sk, tpr_sk, thr_sk = sk_roc_curve(y_true, y_score)
    auc_sk = sk_roc_auc_score(y_true, y_score)
    auc_sk_trapz = auc_trapz(fpr_sk, tpr_sk)

    print("sklearn roc_curve points:", fpr_sk.size)
    print("AUC (sklearn roc_auc_score):", auc_sk)
    print("AUC (sklearn roc_curve + trapz):", auc_sk_trapz)
    print("AUC abs diff (ours vs sklearn):", abs(auc_roc - auc_sk))
except Exception as e:
    print("sklearn check skipped ->", repr(e))
AUC (NumPy trapezoid): 0.79928
sklearn roc_curve points: 226
AUC (sklearn roc_auc_score): 0.79928
AUC (sklearn roc_curve + trapz): 0.79928
AUC abs diff (ours vs sklearn): 0.0
hover_text = [
    "t=+inf" if not np.isfinite(t) else f"t={t:.3f}" for t in thresholds
]

fig = go.Figure()
fig.add_scatter(
    x=fpr,
    y=tpr,
    mode="lines+markers",
    name=f"ROC (AUC={auc_roc:.3f})",
    text=hover_text,
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>%{text}<extra></extra>",
)
fig.add_scatter(
    x=[0, 1],
    y=[0, 1],
    mode="lines",
    name="random (AUC=0.5)",
    line=dict(dash="dash", color="gray"),
)
fig.add_scatter(
    x=[fpr_example],
    y=[tpr_example],
    mode="markers",
    name=f"example t={threshold_example:.2f}",
    marker=dict(size=11, symbol="star", color="black"),
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>example threshold<extra></extra>",
)

fig.update_layout(
    title="ROC curve: sweep threshold to trade FPR vs TPR",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()
# How TPR/FPR evolve as the threshold moves (note: thresholds are in descending score order).
fig = go.Figure()
fig.add_scatter(x=thresholds[1:], y=tpr[1:], mode="lines+markers", name="TPR")
fig.add_scatter(x=thresholds[1:], y=fpr[1:], mode="lines+markers", name="FPR")
fig.add_vline(x=threshold_example, line_width=2, line_dash="dash", line_color="black")
fig.update_layout(
    title="TPR and FPR as functions of the decision threshold",
    xaxis_title="threshold t (higher ⇒ stricter)",
    yaxis_title="rate",
)
fig.update_xaxes(autorange="reversed")
fig.show()

3) AUC: a single-number summary (and what it means)#

The ROC curve is a whole family of operating points. A common summary is the Area Under the ROC Curve (AUC-ROC).

A helpful interpretation (ranking view):

\[ \mathrm{AUC} = P(s^+ > s^-) + \frac{1}{2} P(s^+ = s^-) \]

So AUC is about how well the model ranks positives above negatives (not about calibration).

AUC is convenient, but it can hide important details (for example, you might only care about very low FPR).

4) Using ROC/AUC to tune a classifier (logistic regression from scratch)#

ROC curves need scores. Logistic regression produces a probability score \(\hat{p}(y=1\mid x)\), so it’s a natural match.

Important nuance:

  • We usually fit logistic regression by minimizing log loss (it is smooth and differentiable).

  • We often select hyperparameters (e.g., regularization strength) by maximizing a metric like AUC-ROC on a validation set.

  • We then choose an operating threshold based on business constraints, often using the ROC curve.

# Synthetic 2D binary classification dataset (logistic generative story)
n = 1200
X = rng.normal(size=(n, 2))

w_true = np.array([1.5, -2.0])
b_true = 0.2

logits = X @ w_true + b_true + rng.normal(0, 0.8, size=n)
p = 1.0 / (1.0 + np.exp(-logits))
y = rng.binomial(1, p).astype(int)

# Train/validation split
perm = rng.permutation(n)
n_train = int(0.7 * n)
train_idx = perm[:n_train]
val_idx = perm[n_train:]

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

# Standardize using training statistics
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12

X_train_s = (X_train - mu) / sigma
X_val_s = (X_val - mu) / sigma


def add_intercept(X):
    return np.c_[np.ones(X.shape[0]), X]


Xb_train = add_intercept(X_train_s)
Xb_val = add_intercept(X_val_s)

fig = px.scatter(
    x=X_val_s[:, 0],
    y=X_val_s[:, 1],
    color=y_val.astype(str),
    title="Validation split (standardized features)",
    labels={"x": "x1 (standardized)", "y": "x2 (standardized)", "color": "y"},
    opacity=0.7,
)
fig.show()
def sigmoid(z):
    """Numerically stable sigmoid."""
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def fit_logreg_gd(X, y, l2=0.0, lr=0.2, n_iter=2500):
    """Logistic regression via batch gradient descent (L2 on weights, not intercept)."""
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    w = np.zeros(X.shape[1], dtype=float)
    history = np.empty(n_iter, dtype=float)
    eps = 1e-12

    for i in range(n_iter):
        z = X @ w
        p = sigmoid(z)

        # Regularized log loss (average)
        data_loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
        reg_loss = 0.5 * l2 * np.sum(w[1:] ** 2)
        loss = data_loss + reg_loss

        grad = (X.T @ (p - y)) / X.shape[0]
        grad[1:] += l2 * w[1:]

        w -= lr * grad
        history[i] = loss

    return w, history
# Hyperparameter tuning: pick L2 strength that maximizes validation AUC-ROC
l2_grid = np.logspace(-4, 1, 10)

weights_by_l2 = {}
auc_by_l2 = []

for l2 in l2_grid:
    w, _ = fit_logreg_gd(Xb_train, y_train, l2=l2, lr=0.2, n_iter=2500)
    weights_by_l2[float(l2)] = w

    scores_val = sigmoid(Xb_val @ w)
    fpr_v, tpr_v, _ = roc_curve_numpy(y_val, scores_val)
    auc_v = auc_trapz(fpr_v, tpr_v)
    auc_by_l2.append(float(auc_v))

auc_by_l2 = np.array(auc_by_l2)
best_i = int(np.argmax(auc_by_l2))
best_l2 = float(l2_grid[best_i])
best_auc = float(auc_by_l2[best_i])

print(f"best l2: {best_l2:g}")
print(f"best validation AUC: {best_auc:.4f}")

fig = go.Figure()
fig.add_scatter(
    x=l2_grid,
    y=auc_by_l2,
    mode="lines+markers",
    name="validation AUC",
)
fig.add_scatter(
    x=[best_l2],
    y=[best_auc],
    mode="markers",
    marker=dict(size=12, symbol="star", color="black"),
    name="best",
)
fig.update_xaxes(type="log", title="L2 regularization strength (λ)")
fig.update_yaxes(title="Validation AUC-ROC", range=[0, 1])
fig.update_layout(title="Model selection: choose λ that maximizes AUC-ROC")
fig.show()
best l2: 0.215443
best validation AUC: 0.8933
# Refit the best model to visualize training loss
best_w, loss_hist = fit_logreg_gd(Xb_train, y_train, l2=best_l2, lr=0.2, n_iter=2500)

fig = px.line(
    y=loss_hist,
    title=f"Training loss (regularized log loss), best λ={best_l2:g}",
    labels={"x": "iteration", "y": "loss"},
)
fig.show()
# Compare ROC curves for a few regularization settings
candidates = [float(l2_grid[0]), best_l2, float(l2_grid[-1])]

fig = go.Figure()
for l2 in candidates:
    w = best_w if l2 == best_l2 else weights_by_l2[l2]
    scores_val = sigmoid(Xb_val @ w)
    fpr_v, tpr_v, _ = roc_curve_numpy(y_val, scores_val)
    auc_v = auc_trapz(fpr_v, tpr_v)
    fig.add_scatter(
        x=fpr_v,
        y=tpr_v,
        mode="lines",
        name=f"λ={l2:g} (AUC={auc_v:.3f})",
    )

fig.add_scatter(
    x=[0, 1],
    y=[0, 1],
    mode="lines",
    name="random (AUC=0.5)",
    line=dict(dash="dash", color="gray"),
)
fig.update_layout(
    title="ROC curves on validation set (different regularization strengths)",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()
# Choosing an operating threshold from the ROC curve
# Example constraint: keep FPR <= 5% while maximizing TPR.
scores_val = sigmoid(Xb_val @ best_w)
fpr_v, tpr_v, thr_v = roc_curve_numpy(y_val, scores_val)

target_fpr = 0.05
feasible = np.where(fpr_v <= target_fpr)[0]

chosen_i = int(feasible[np.argmax(tpr_v[feasible])]) if feasible.size else 0
chosen_thr = float(thr_v[chosen_i])

tp, fp, tn, fn = confusion_counts(y_val, scores_val, chosen_thr)
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0

print(f"target FPR <= {target_fpr:.2f}")
print(f"chosen threshold: {chosen_thr:.4f}")
print(f"FPR={fpr_v[chosen_i]:.4f}, TPR={tpr_v[chosen_i]:.4f}")
print(f"TP={tp}, FP={fp}, TN={tn}, FN={fn}")
print(f"precision={precision:.4f}, recall={recall:.4f}")

default_thr = 0.5
tp2, fp2, tn2, fn2 = confusion_counts(y_val, scores_val, default_thr)
tpr2 = tp2 / (tp2 + fn2)
fpr2 = fp2 / (fp2 + tn2)

fig = go.Figure()
fig.add_scatter(
    x=fpr_v,
    y=tpr_v,
    mode="lines",
    name="ROC (best model)",
)
fig.add_scatter(
    x=[fpr_v[chosen_i]],
    y=[tpr_v[chosen_i]],
    mode="markers",
    marker=dict(size=12, symbol="star", color="black"),
    name=f"chosen (FPR≤{target_fpr:.2f})",
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>chosen threshold<extra></extra>",
)
fig.add_scatter(
    x=[fpr2],
    y=[tpr2],
    mode="markers",
    marker=dict(size=10, symbol="circle", color="gray"),
    name="default t=0.5",
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>t=0.5<extra></extra>",
)
fig.add_vline(x=target_fpr, line_width=1, line_dash="dash", line_color="gray")
fig.update_layout(
    title="Picking a threshold from ROC constraints",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()
target FPR <= 0.05
chosen threshold: 0.6374
FPR=0.0462, TPR=0.5134
TP=96, FP=8, TN=165, FN=91
precision=0.9231, recall=0.5134

Pros, cons, and when ROC is a good choice#

Pros#

  • Threshold-free view: shows the entire tradeoff curve instead of committing to one \(t\).

  • Ranking-focused: works naturally with scores (and AUC has a clean ranking interpretation).

  • Model comparison: curves make it easy to compare classifiers across operating points.

  • Less sensitive to class imbalance than accuracy (it uses rates, not raw counts).

Cons / pitfalls#

  • Can look overly optimistic on very imbalanced problems: a small FPR can still mean many false positives in absolute count.

  • AUC can hide the region you actually care about (e.g., only FPR < 1%). Consider partial AUC or zooming.

  • Not about calibration: a perfectly calibrated model and a poorly calibrated model can have the same ROC/AUC.

  • Needs scores: if you pass hard labels, you’ll get only a couple of ROC points.

Good use cases#

  • You care about ranking (retrieval, screening, triage) and want a threshold later.

  • You have a constraint like “FPR must be below X” and want the best achievable TPR.

  • Comparing multiple models when operating conditions or cost ratios are not yet fixed.

Exercises#

  1. Implement a sample-weighted ROC curve (each point has weight \(w_i\)).

  2. Show a case where two models have similar AUC but very different performance for FPR < 1%.

  3. Create an extremely imbalanced dataset and compare ROC vs precision-recall curves (why might PR be more informative?).

  4. Derive the ranking interpretation of AUC from scratch.

References#

  • scikit-learn: roc_curve: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

  • scikit-learn: roc_auc_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

  • Tom Fawcett (2006), An introduction to ROC analysis: https://doi.org/10.1016/j.patrec.2005.10.010

  • Wikipedia: Receiver operating characteristic: https://en.wikipedia.org/wiki/Receiver_operating_characteristic